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Abstract 

Background: The ability to design tliermostable proteins is theoretically important and practically useful. Robust 
and accurate algorithms, however, remain elusive. One critical problem is the lack of reliable methods to estimate 
the relative thermostability of possible mutants. 

Results: We report a novel scoring function for discriminating hyperthermophilic and mesophilic proteins with 
application to predicting the relative thermostability of protein mutants. The scoring function was developed based 
on an elaborate analysis of a set of features calculated or predicted from 540 pairs of hyperthermophilic and 
mesophilic protein ortholog sequences. It was constructed by a linear combination often important features 
identified by a feature ranking procedure based on the random forest classification algorithm. The weights of these 
features in the scoring function were fitted by a hill-climbing algorithm. This scoring function has shown an 
excellent ability to discriminate hyperthermophilic from mesophilic sequences. The prediction accuracies reached 
98.9% and 97.3% in discriminating orthologous pairs in training and the holdout testing datasets, respectively. 
Moreover, the scoring function can distinguish non-homologous sequences with an accuracy of 88.4%. Additional 
blind tests using two datasets of experimentally investigated mutations demonstrated that the scoring function 
can be used to predict the relative thermostability of proteins and their mutants at very high accuracies (92.9% 
and 94.4%). We also developed an amino acid substitution preference matrix between mesophilic and 
hyperthermophilic proteins, which may be useful in designing more thermostable proteins. 

Conclusions: We have presented a novel scoring function which can distinguish not only HP/MP ortholog pairs, 
but also non-homologous pairs at high accuracies. Most importantly, it can be used to accurately predict the 
relative stability of proteins and their mutants, as demonstrated in two blind tests. In addition, the residue 
substitution preference matrix assembled in this study may reflect the thermal adaptation induced substitution 
biases. A web server implementing the scoring function and the dataset used in this study are freely available at 
http://www.abl.ku.edu/thermorank/. 
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Background 

Developing thermostable proteins has been a main focus 
of protein engineering because of its theoretical and prac- 
tical significance [1-4]. Recently, computational protein 
design methods have been attracted much attention due 
to their potential cost and time savings over conventional 
directed evolution approaches [3,5,6]. These types of 
approaches utilize information extracted from protein 
sequences and/or 3D structures to predict favorable 
mutations that may enhance protein thermostability. 
Clearly, a key step in such approaches is the development 
of reliable methods for estimating the relative stability of 
possible mutants to identify favorable mutations. Such 
methods may also help better understand the protein- 
folding problem since the ultimate outcome of protein 
folding is a native structure with the lowest free energy 
among many possible structures of a protein. 

A common approach to study the thermostability of 
proteins is to perform comparative studies of the 
sequences and/or structures of (hyper)thermophilic pro- 
teins (HPs) and their mesophilic counterparts (MPs) 
[7-15] because there exists a direct positive correlation 
between the optimal growth temperature (OGT) of an 
organism and the melting temperature of its proteins, a 
key metric of protein thermostability [16,17]. Numerous 
studies have focused on amino acid composition 
changes caused by thermal adaptation at the whole gen- 
ome level [7,14,18]. For example, Zeldovich et al. dis- 
covered that the total concentration of seven amino 
acids (INYWREL) in the proteins of an organism has a 
strong correlation with its OGT [14]. Overall, the pro- 
teins of thermophiles contain more charged and hydro- 
phobic amino acid residues at the expense of polar ones 
[7,14,18]. The observed composition differences have 
prompted the development of predictive models discri- 
minating HPs and MPs [19-21]. For example, Gromiha 
and Suresh applied 12 different classification algorithms 
and the best accuracy achieved reached 89% [21]. 

Several amino acid substitution preference matrices 
have been created based on the sequence alignments of 
thermophilic proteins and their mesophilic homologues 
[22-24]. Analyzing these matrices and comparing 
sequences and structures of HPs and MPs have revealed 
a number of substitution trends potentially affecting 
thermostability [7,8]. Notable features include: an 
increased level of charged residues in hyperthermophilic 
proteins at the cost of polar residues on surface com- 
pared to their mesophilic homologs [23,25,26]; elevated 
levels of proline or (3 -branched amino acids in loops to 
reduce the freedom of coil regions [1,27]; a reduced 
number of residues in coil regions but increases in helix 
runs [28,29]; increased numbers of the high helix-pro- 
pensity residues such as Lys and Glu, etc. [30]; an 



increased compactness of hydrophobic cores resulted in 
enhanced apolar interactions and interior packing 
[30-32]; and reduced deamidation probability by repla- 
cing Gin with Glu and Asn with Asp [33,34]. 

The goal of this study was to develop a scoring func- 
tion for predicting relative thermostability of protein and 
their mutants using an integrated statistical and machine 
learning approach. We used HP/MP orthologs as 
research subjects because they are equivalent to mutants 
with multiple substitutions and, as discussed above, the 
difference between them may encode thermal-adaptation 
mechanisms. Thus a scoring function which can distin- 
guish HP/MP orthologs is presumably able to rank the 
relative stability of a protein and its mutants, a key step 
for designing more thermostable proteins. 

In this study, we first constructed a set of 540 non- 
redundant hyperthermophilic-mesophilic protein ortho- 
log pairs. Since our dataset is significantly bigger than 
previous studies, we then calculated a substitution pre- 
ference matrix using an established approach 
[11,12,22-24]. We used a feature selection procedure 
based on the random forest algorithm to identify 
sequence-based features important to pairwise discrimi- 
nation of hyperthermophilic and mesophilic protein 
orthologs. We then used a hill-climbing algorithm to fit 
a scoring function based on a linear combination of 
these important discriminating features. Finally, we 
applied the scoring function to two experimental data- 
sets to demonstrate that this scoring function can 
indeed be used to rank thermostability of protein 
mutants with high accuracy. 

Methods 

Datasets 

We downloaded all protein sequences of nine organ- 
isms, including four hyperthermophilic and five meso- 
philic organisms (Table 1) from the NCBI http://www. 
ncbi.nlm.nih.gov/. To identify HP-MP ortholog pairs, we 

Table 1 The list of organisms whose proteins were used 
to generate the non-redundant hyperthermophilic 
(upper) and mesophilic (bottom) orthologous pairs 
(adopted from [48]). 



Organism 


Number of proteins OGT (°C) 


Aquifex aeolicus VFS 


1550 


96 




Methanocaldococcus jannaschii DSM 


1786 


85 




Thermotoga maritima MSB8 


1858 


80 




Pyrococcus abyssi GE5 


1898 


103 




Corynebacterium glutamicus ATCC 


2993 


30 - 


40 


Escherichia coli K12 


4237 


37 




Mycobacterium tuberculosis H37Rv 


3991 


37 




Bacillus halodurans C-125 


4066 


25 - 


35 


Streptococcus pneumoniae TIGR4 


2094 


30 - 


35 
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performed BLAST searches for all MP sequences against 
all HP sequences [35]. The following conservative cri- 
teria were used to identify putative orthologs: 

♦ Reciprocal best BLAST hits with the e-values in 
BLAST searches less than 10'^"; 

♦ The difference in the number of residues is less 
than 5% of the shorter sequence so that only small 
insertions/deletions were allowed; 

♦ Higher than 30% amino acid sequence identity. 

In addition, we removed transmembrane proteins, 
predicted by TMHMM 2.0 http://www.cbs.dtu.dk/ser- 
vices/TMHMM/, because they often use different stra- 
tegies from soluble proteins to survive under high 
temperature environments [36]. Furthermore, to 
reduce the statistical bias caused by redundancy, we 
clustered paralogues using the blastclust program 
available in the BLAST package [35]. The minimum 
length coverage of blastclust was set to 0.5 and the 
sequence similarity threshold was set to 0.25. 
Sequences longer than 600 or shorter than 50 residues 
were also removed. The final dataset consists of 540 
non-redundant HP-MP ortholog pairs. Pfam http:// 
pfam.sanger.ac.uk/ domain scans of these proteins con- 
firmed, as expected, that the two proteins of each 
ortholog pair contain the same domains. Thus the 
selected pairs are very likely true orthologs. 

We also used a set of 373 structurally well-aligned 
protein pairs from (hyper)thermophilic and mesophilic 
organisms compiled by Glyakina et al. for testing pur- 
pose [37]. The dataset includes 63 hyperthermophilic 
and 310 thermophilic proteins. 

Amino acid substitution matrix 

The amino acid residue substitution matrix was con- 
structed following an established procedure 
[11,12,22-24]. In brief, we counted each of the 380 types 
of amino acid residue substitutions in the BLAST 
sequence alignments of all MP/HP pairs. Substitutions 
in converting MPs to HPs are considered as the "for- 
ward" direction. Two-tail binomial statistics were used 
to estimate the statistical significance of the asymmetry 
of the forward and reverse substitutions of any given 
pair of amino acids [23]. 

Two sets of experimentally investigated protein 
mutations 

In addition to the protein pairs mentioned before, we used 
two independent datasets for additional testing. The first 
set contains two wild-type adenylate kinases (ADKs) from 
Methanococcus Voltae and Methanococcus Jannaschii, and 
a series of chimeric proteins generated from these two 
enzymes [38]. These proteins share significant sequence 



identity but differ in their thermostability (Table 2). The 
second dataset was collected by Montanucci and collea- 
gues [20]. It contains 10 wild type proteins and 14 mutants 
manifesting thermal stability changes (Table 3). All protein 
sequences in these two tests were subjected to BLAST 
searches against all sequences in the 540 ortholog pairs. 
Only one protein, BsCSP (GI: 16077975), showed greater 
than 25% similarity to one of the protein sequences in the 



Table 2 Two wild-type ADKs and a series of chimeric 
enzymes generated from these two enzymes[38]. 



Seq_ID Comm_meso 


Comm_hyp 


T„ (°C) 


Ranking 


MJA 0 


62 


103 


8 


V36J 9 


53 


98 


7 


J160V 9 


53 


96 


6 


JVJ 37 


25 


89 


4 


VJV 20 


42 


82.5 


5 


V160J 51 


11 


74 


2 


J36V 53 


9 


73 


3 


MVO 62 


0 


69 


1 


Comm_meso and comm_hyp are the counts of the identical residues in the MP 
sequence IWVO and the HP sequence IWJA, respeaively. The last column is the 
relative stability ranked by our scoring function (from least to most stable). 


Table 3 The ranking of relative thermostability of wild 
type proteins and their mutated sequences using the 
scoring function. 


Protein name 


length 


T„,(°C) 


Ranking 


Dmeh (GI: 640374) 


51 


49 


1 


Dmeh„UMC 


51 


99 


2 


Dmeh_UVF 


51 


99 


3 


BsCSP (GI: 16077975) 


67 


53.8 


1 


BsCSP_mt1 


67 


69.7 


2 


BsCSP_mt2 


67 


83.7 


3 


PhyA (GI: 464382) 


467 


55 


1 


PhyA_mt18 


467 


62 


2 


PhyA_mt24 


467 


62+ 


3 


PTDH (GI: 194552172) 


336 


39 


1 


PTDH_12x 


336 


59.7 


2 


PTDH_optl4 


336 


64.4 


3 


CbADH (GI: 187935035) 


351 


64.5 


1 


cbADH_Q100P 


351 


76 


2 


P-GUS (GI: 868020) 


602 


45 


1 


P-GUS_TR3337 


602 


65 


2 


FAOX (GI: 20302586) 


372 


37 


1 


FAOX_TE 


372 


45 


2 


Shble (GI: 3891709) 


121 


67.4 


1 


Shble_HTS 


121 


85.1 


2 


EcHPH (GI: 12539) 


341 


51 


1 


EcHPH_hph5 


341 


67 


2 


PDAO (GI: 129305) 


347 


45 


2 


PDAO_F42C 


347 


55 


1 



The data were originally collected by Montanucci et al. [20]. The sequence of 
cbADH was retrieved from the original literature by Goihberg et o/.[49]. 
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540 pairs. Thus, these proteins can be used as independent 
testing datasets. 

Features 

A set of 83 features derived from protein sequences was 
calculated using various software programs or in-house 
scripts (Table 4, more information about these feature is 
available in the additional file 1). These features can be 
roughly classified into two groups. The features in the 
first group, denoted as c^t, are the absolute counts of 
amino acid residues or other properties. The features in 
the second group, labeled as X/^, are the chain length 
normalized values of the features in the first group. 
Although including structure-based features may be of 
great help in understanding the mechanisms of muta- 
genesis induced protein stabilization, the vast majority 
of proteins lack solved structures. Therefore we only 
investigated the contributions from sequence-based fea- 
tures in this work. Furthermore, the theory that the 
sequence of a protein determines its structure suggests 
that the knowledge extracted from the sequence may be 
sufficient to distinguish proteins with different thermo- 
stability. Besides the general information extracted from 
the sequence, we also included several predicted features 
which were obtained by mature and widely-used algo- 
rithms, such as those used to predict secondary struc- 
ture [39] and exposed/buried residues [40] (Table 4). 

Random Forest 

The random forest algorithm is an ensemble technique 
that utilizes the results of hundreds or even thousands 
of decision trees to perform classification or regression 



[41]. Each of the member trees is built on a bootstrap 
sample from the training data and utilizes a random 
subset of available variables. The algorithm has been 
applied in broad classification tasks and has frequently 
demonstrated superior performance compared to other 
classification algorithms [42,43]. It is robust and particu- 
larly suitable for classifying high-dimensional and noisy 
data. One very useful feature of the algorithm is that it 
offers several methods to assess the importance of var- 
ious features based on their contributions to the correct- 
ness of the resulting classification [41]. In this study, we 
used the Gini importance to rank the importance of all 
used features. The Gini importance is the summation of 
the Gini impurity decreases in node splits made on the 
feature over all trees in the model. The Gini impurity is 
a common metric to measure the degree of impurity 
[44]. It is defined as: 

m 

I{A) = 1-^pI (1) 

k=i 

where k = (1, 2, ...,m) are possible classes and is the 
relative frequency of class in a node A. Therefore /(A) 
equals to zero when all cases in the node belong to a 
single class and reaches its maximum when cases are 
equally distributed to all classes. 

We used a random forest package implemented in the R 
environment for this study http://cran.r-project.org/web/ 
packages/randomForest/index.html. Random forest mod- 
els are usually insensitive to the model parameters [41]. 
Consequently the default parameters were used in the 
study. 



Table 4 The list of the 83 features used in the study. 



Protein feature 


Number of 
Features 


Source 


Sequence length (L) 


1 


In-house script 


Count and composition of amino acids 


40 


In-house script 


Number and percentage of positive, negative and all charged residues, as well as the net charges 


8 


In-house script 


Number and percentage of small (T and D), tiny (G, A, S and P), aromatic (F, H, Y, W), aliphatic, hydrophobic 

and polar residues 


12 


In-house script 


Number and percentage of residues which can form hydrogen bond in sidechain 


2 


In-house script 


Number of sulfide atoms 


1 


In-house script 


Average solubility of amino acids in aqueous solutions under room temperature 


1 


** 


The average of the maximum solvent accessible surface area (ASA) of each amino acid 


1 


Eisenhaber[50] 


Predicted isoelectric point (pi) of the protein, the average pi on all residues (pla) 


2 


ProtParam[51] 


Instability index and instability class 


2 




Aliphatic index 1 


Gravy hydropathy index 1 


Composition of the predicted secondary structure residues 


3 


Psipred[52] 


Predicted percentages of buried/exposed residues 


2 


Accpro[40] 


The overall length and percentage of all coils, rem465, and hotloop 


6 


disEMBL[53] 



'Obtained from The Merck Index, Merck & Co., Inc., Whitehouse Station, NJ 12 (1996). 
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Results and Discussion 

In this section, we first report a MP/HP residue substi- 
tution preference matrix generated from the BLAST 
pairwise alignments of MP and HP orthologs. Feature 
selection using the random forest algorithm is then 
described, followed by the scoring function construction. 
The performance of the scoring function in discriminat- 
ing hyperthermophilic and mesophilic proteins was esti- 
mated with a set of holdout testing dataset. Finally, the 
application of the scoring function in predicting relative 
stability of proteins and their mutants is presented. 

Amino acid composition 

The overall differences in amino acid composition 
between HPs and MPs are consistent with previous 
reports (Table 5) [7,11,22-24,45]. Based on the p-values 
from unpaired and paired t-test, the most significantly 
increased residues in HPs include Lys, Glu, Tyr, and He, 
while reduced residues include Gin, His, Ala, and Thr. 

The 540 HPs consist of 426 bacteria and 114 archaea 
proteins while all MPs are from bacteria. In order to 
rule out the possibility that the different domains cause 
bias toward residue composition and the final results, 
we calculated the correlation coefficients of the amino 
acid compositions between HPs from archaea and MPs, 
HPs from bacteria and MPs, and HPs from bacteria and 
HPs from archaea. The R values of the correlations are 
0.779, 0.828, and 0.968, respectively (Figure 1). 



Therefore, the composition difference possibly attributed 
to bacteria vs. archaea domains isn't as significant as the 
contributions by thermal adaptation. 

Amino acid residue substitutions 

All 380 residue substitutions are reported in Figure 2. We 
also calculated the ratio of each substitution to the oppo- 
site replacement. Substitutions with statistically signifi- 
cant bias (p < 10'^°) are shown in bold. Red cells are 
substitutions favored in the MP to HP direction while 
blues are favored in the opposite direction. There are 84 
(22%) significantly biased substitutions and 44 of them 
are in the direction from MP to HP. The overall trends of 
the substitution preferences are consistent with previous 
studies [22-24] . For example, charged residues, especially 
Lys and Glu, gain significantly in HPs at the cost of 
uncharged polar residues such as Ser, Gin, and Tyr. 

Many of the significant substitution asymmetries are 
consistent with various proposed protein thermo stabi- 
lity mechanisms. For example, Asp is preferred to be 
substituted by Glu or Lys in the direction from MP to 
HP, both are helix favored while Asp is coil favored. 
This is consistent with previous findings that in general 
HPs contain more helical regions at the cost of disor- 
dered regions than MPs [28,29]. There is a strong pre- 
ference for Ser, Thr, Asn and Gin to be substituted by 
Lys and Glu in HPs, which can be explained by the 
observed significant reduction of polar non-charged 



Table 5 Comparison of the composition of the amino acids in hyperthermophilic and mesophilic proteins and their 
significance p-values of t-test and paired t-test. 



Amino acid 


Composition in HP 


Composition in MP 


p-value (t-test) 


p-value (paired t-test) 


S 


0.044 ± 0.016 


0.050 ±0.015 


9.60x10"' 


5.61x10"*^ 


Q 


0.019 ± o.on 


0.037 ± 0.015 


6.81x10""^ 


1.24x10'*"* 


N 


0.035 ± 0.014 


0.035 ±0.015 


0.88 


0.85 


T 


0.042 ± 0.014 


0.055 ± 0.016 


1.02x10"" 


2.44x10"^** 


C 


0.009 ± 0.01 1 


0.010 ± 0.011 


0.36 


0.08 


G 


0.075 ± 0.019 


0.079 ± 0.020 


9.14x10"* 


9.68x10'*° 


A 


0.066 ± 0.023 


0.080 ± 0.028 


3.41x10"^ 


1.08x10'''' 


H 


0.017 ± 0.010 


0.024 ±0.013 


3.00x10"^° 


4.64x10"^ 


M 


0.024 ± 0.01 1 


0.026 ±0.010 


0.02 


3.00x10'^ 


Y 


0.033 ± 0.014 


0.027 ±0.013 


610x10"*^ 


4.47x10'^* 


F 


0.038 ± 0.015 


0.033 ±0.014 


3.00x10"^ 


1.29x10'*^* 


V 


0.086 ± 0.021 


0.082 ± 0.020 


2.32x10"* 


5.36x10'' 


L 


0.089 ± 0.021 


0.089 ± 0.022 


0.73 


0.59 


P 


0.041 ± 0.015 


0.040 ±0.014 


0.39 


0.16 


1 


0.077 ± 0.020 


0.066 ±0.019 


6.46x10'^° 


3.15x10'^'' 


W 


0.008 ± 0.007 


0.007 ± 0.007 


0.05 


3.00x10'^ 


D 


0.050 ±0.015 


0.057 ± 0.016 


1.92x10"*^ 


5.64x10'^^ 


E 


0.097 ± 0.023 


0.079 ± 0.022 


673x10"^^ 


5.63x10''= 


K 


0.091 ± 0.023 


0.060 ± 0.023 


121x10""' 


5.25x10'**' 


R 


0.056 ± 0.023 


0.055 ± 0.023 


0.57 


0.36 
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Figure 1 The pariwise comparisons of amino acid compositions 
in the three different sets of proteins The solid lines show the 
best-fit of linear regression lines with regression coefficient and 
slope displayed and the dash lines show the orthogonal line. 
Bac_M, arc_H and bac_H are proteins from mesophilic bacteria, 
hyperthermophilic archaea and bacteria, respectively. 



differentially. In this section, we describe a procedure 
for ranking the importance of all 83 features derived 
from protein sequences in discriminating MPs and HPs 
using the random forest algorithm. 

A standard five-fold cross validation procedure was 
used to determine the importance of features and 
develop the scoring function. We randomly split the 540 
pairs into five equal portions. We used four portions as 
training datasets and reserved the remaining portion for 
testing purposes. We then constructed a random forest 
model with 3000 trees for discriminating these 432 
ortholog protein pairs in the training set and then used 
the Gini importance to rank these features. The proce- 
dure was repeated four more times and each time a dif- 
ferent portion was used as the testing dataset. We found 
that the results from all five runs were very consistent. 
All features were ranked by their average importance 
and top 25 are shown in Figure 3. The levels of gluta- 
mine and lysine are most important among the 83 fea- 
tures used in this study, followed by the percentage of 
positively charged residues. We also noticed that the 
features normalized by sequence length are consistently 
more important than the corresponding absolute counts. 
Thus, we only used the normalized features in the scor- 
ing function. Interestingly, all predicted features, such as 
secondary structure, the ratio of exposed to buried resi- 
dues, and the disordered region predictions, failed to 
appear in the 25 most important features. 



residues [23,26] and deamidation vulnerable residues 
[33,34] in HPs. Leu is preferred to be substituted by He 
to enhance thermo stability. This is consistent with the 
finding that increasing P -branched amino acids in loop 
regions enhance protein thermostability [1,27]. 

It is worth mentioning that the significance threshold 
(p < 10'^°, Fisher's exact test [46]) used in this study 
was significantly more stringent than the criteria used in 
previous studies (e.g. p < 10'^) because we used approxi- 
mately five times as many HP/MP pairs as previous stu- 
dies. The ratios of forward-to-reverse changes for these 
substitutions were also calculated based on more exam- 
ples than in previous studies. For example, the matrix 
reported by Haney et al. contained 72 residue replace- 
ments with no or only single instances [23]. In our 
matrix, the minimum number is 3 and there are only 14 
substitutions with less than 10 examples. Therefore the 
ratios in this matrix may better reflect thermal adapta- 
tion induced substitution biases and should be useful in 
designing thermostable proteins. 

Ranking features using a random forest algorithm 

The analysis of the residue substitution preference 
between MPs and HPs clearly indicates that different 
residues contribute to protein thermostability 



Developing the scoring function 

We first calculated the relative feature difference Axj, 
which is defined as: 



X;(seql)-X;(seq2) 
x,(seql)+xj(seq2) 



(2) 



where Xi{&e.<\l) and ;v;,(seq2) are the values of the ith 
feature from the first sequence and the second 
sequence, respectively. We plotted the cumulative curves 
of the relative feature difference ISjCi of the ten most sig- 
nificant normalized features in the training dataset (Fig- 
ure 4). In this plot, all cumulative curves show typical 
sigmoid shapes in which the inflexion points are located 
in the curve at the half height, i.e., the cumulative 
counts are equal to half of the total counts. 

We constructed the scoring function by a linear com- 
bination of the ten most important features. The scoring 
function can be written as: 



Score(seql,seq2) = ^ WjAXj 



(3) 



where / runs over all 10 features are used in the scor- 
ing function and vi/, is the weight for each feature. The 
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Figure 2 Amino acid substitutions between mesophilic and hyperthermophilic proteins. The top number in each cell is the observed 
substitution instances and the bottom one (in italics) is the ratio of the number of the substitution cases to the opposite substitution. Significant 
biased substitutions (p-value < 10"^'', two-sided Fisher's exact test) are highlighted in bold. Red cells are significant HP favored substitutions while 
blues are IVIP favored. 



sign of the weight of each feature was determined by the 
location of the inflexion point of its cumulative curve: 
positive for features located to the left and negative for 
those to the right of the zero-difference line. Thus the 
signs of x_K, x_E, xjpos, ;«_charge, and ASA are posi- 
tive, and negative for ;«_small, ;«_tiny, ;>c_A, x_Q and 
x_T. We then used a hill-climbing algorithm to fit the 
weights of these features. The absolute values of all 
weights were restricted to the range of 0 to 1. We ran- 
domly assigned an initial weight to each feature and 
counted the number of correctly ranked ortholog pairs. 



The weights were then randomly updated and the num- 
ber of correct ranks was recounted. The new weights 
were kept if they resulted in more correctly ranked 
ortholog pairs; otherwise the weights were rolled back 
to the previous values. This procedure was repeated 5 x 
10^ times and the batch of weights which maximized 
the number of positive score values was recorded. To 
check whether the optimization procedure was trapped 
in a local maximum, we repeated the procedure four 
more times using different random seeds. The results 
were very similar and thus we simply used the average 
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Figure 3 The 25 most important features ranl<ed by the Gin! importance of the random forest algorithm The prefixes c_ and x_ of each 
feature indicate that the feature is an absolute count or normalized value, respectively. 




of the weights in the scoring function. We then used the 
same procedure to develop four more scoring functions, 
each for one of remaining training datasets. 

The discrimination ability of the scoring function 

We calculated the accuracies of the discriminations 
made by the five scoring functions on their correspond- 
ing training datasets. The scoring functions using 



optimized weights were able to distinguish in average 
427.1 ± 1.9 out of 432 (98.9% accuracy) ortholog protein 
pairs in the training datasets. We then tested each of the 
scoring functions with its corresponding holdout testing 
dataset. Out of 108 protein pairs in the testing sets, on 
average 105.1 ± 0.5 pairs were correctly ranked (97.3% 
accuracy). This was very close to the accuracy obtained 
from the training sets (98.9%). Thus the scoring function 
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Table 6 The final weights of the ten features used in the scoring function. 



Feature 


x_K 


x_E 


x_pos 


x_charge 


ASA 


x_small 


x_tiny 


x_A 


x_Q 


x_T 


Weight 


0.75 


0.20 


0.80 


0.20 


0.90 


-0.20 


-0.20 


-0.30 


-0.10 


-0.20 



is robust and able to discriminate a broad spectrum of 
HP and MP homologous protein pairs. 

The average weights for each feature determined in all 
five training procedures are quite consistent. Thus we 
simply use the averages of these weights in the final 
scoring function (Table 6). It is noteworthy that while 
the signs of the weights indicate whether the features 
are favorable or not in hyperthermophilic proteins, their 
absolute values are not significant since the features are 
not normalized to a common scale. 

We also applied the scoring function to discriminating 
(hyper)thermophilic and mesophilic proteins in the 
Glyakina dataset [37]. Our scoring function was able to 
correctly discriminate not only 59 HP/MP pairs (93.7% 
accuracy), but also 238 thermophilic and mesophilic 
pairs (76.8% accuracy). The list of these proteins and 
their scores are provided in Table S2 in the additional 
file 1. We believe that the difference of the accuracy 
between hyperthermophilic and thermophilic proteins 
was caused by the different stabilization mechanisms of 
hyperthermophilic and thermophilic proteins, as pre- 
viously suggested in literature [17,31]. 

Discriminating non-homologous protein pairs 

Encouraged by the results in the above test, we further 
challenged the scoring function in discriminating non- 
homologous HP/MP protein pairs. In this test, we com- 
pared each HP protein sequence against all MP 
sequences. The overall accuracy of these 540*540 pair- 
wise comparisons was 88.4%. Such a high accuracy in 
discriminating non-homologous HP and MP sequences 
confirms that HP sequences share some common 
sequential patterns to generate sufficient stability at ele- 
vated temperature. 

Application in ranking the thermostability of proteins and 
their mutants 

The first test was carried out on two wild-type ADKs 
and a series of chimeric enzymes generated from these 
two enzymes [38]. The predicted ranking of thermo-sta- 
bility using the scoring function is highly consistent 
with the experimental results (Table 2). In all 28 (cf ) 
pairwise comparisons, only two resulted in incorrect 
predictions (92.9% accuracy). Moreover, the two inaccu- 
rate predictions included one between VJV and JVJ in 
which the Tm differed by only 6.5°C, and the other 
between V160J and J36V in which the Tm differed by 
just 1°C, probably not an experimentally detectable 
difference. 



In the second test, we used a batch of sequences col- 
lected by Montanucci, et al [20]. The sequence lengths, 
the GI numbers of the wild-type proteins, and their 
melting temperatures are listed in Table 3. We used the 
scoring function to rank the relative thermostability of 
wild-type proteins and their mutants. In the case of pro- 
teins with two mutants, the relative stability of these 
mutants was also predicted. Overall there were 18 pair- 
wise comparisons between these wild proteins and their 
mutants. The scoring function achieved an accuracy of 
94.4% (17/18). The wrong prediction was for protein 
PDAO and its mutant (Table 3). It is a single mutation 
(F42C) and the difference in Tm is moderate (10°C). 

Overall, the scoring function has consistently demon- 
strated a remarkable ability to rank the relative thermo- 
stability of proteins and their mutants. Thus a website 
http://www.abl.ku.edu/thermorank/ was created and 
made freely available to the general public. 

Comparison with other Methods 

The current study differs at the level of information 
granules from previous work focused amino acid com- 
position differences between thermophilic and mesophi- 
lic organisms [7,14,18]. We focused on the differences 
between HP and MP ortholog pairs instead of on the 
differences between thermophilic or mesophilic proteins 
at the genome level. The difference between these two 
approaches is similar to the one between unpaired and 
paired two-sample i-tests. While previous studies have 
succeeded in revealing the overall changes caused by 
thermal adaptation at the genome level, our study has 
further focused on the protein level. Such an approach 
may reduce or eliminate the effects of confounding fac- 
tors such as protein families because it is well estab- 
lished that the amino acid composition may vary in 
different protein classes [47]. In addition, a protein level 
study may be more relevant to designing stable proteins 
because orthologs are essentially mutants with multiple 
mutations. 

To compare the performance of our algorithm to 
other approaches is difficult because very few algorithms 
have been developed to rank the relative thermostability 
of HP/MP orthologous pairs and these studies have 
used different datasets [20,48]. TargetStar, a scoring 
function based on the analysis of 1006 decoy structures 
for a given protein, can discriminate HP/MP orthologs 
pairs with 77% accuracy [48]. Recently, Montanucci and 
colleagues reported a SVM model which achieves 88% 
accuracy on a set of redundancy-reduced HP/MP pairs 
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[20]. The SVM model used residue and dipeptide com- 
positions as predictive features. Thus, the 97.3% predic- 
tive accuracy on the test dataset of our scoring function 
is considerably higher than the reported accuracies of 
both previous methods. Moreover, in the application of 
predicting the relative thermostability of proteins and 
their mutants, our approach achieved an accuracy of 
94.4% (17/18) in the second blind test set, which repre- 
sents one more correct prediction than Montanucci et 
al. on the same dataset [20]. 

Conclusions 

We have presented a novel scoring function which can 
distinguish not only HP/MP ortholog pairs, but also 
non-homologous pairs at high accuracies. Most impor- 
tantly, it can be used to accurately predict the relative 
stability of proteins and their mutants, as demonstrated 
in two blind tests. In addition, the residue substitution 
preference matrix assembled in this study may better 
reflect the thermal adaptation induced substitution 
biases than previous studies because a larger dataset was 
used. The large set of HP/MP is available in the supple- 
mentary website and should be useful to other research- 
ers for further development of novel algorithms in this 
area. 



Abbreviations 
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